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Urs Niesen and Asian Tchamkerten 



Abstract 

o 

A novel quickest detection setting is proposed which is a generalization of the well-known Bayesian change- 
point detection model. Suppose {{Xi, li)}j>i is a sequence of pairs of random variables, and that S is a stopping 
bJQ 1 time with respect to {Xi}i>\. The problem is to find a stopping time T with respect to {Yi}j>i that optimally 

tracks S, in the sense that T minimizes the expected reaction delay E(T — S) + , while keeping the false-alarm 
■ probability P(T < S) below a given threshold a £ [0, 1]. This problem formulation applies in several areas, such 

as in communication, detection, forecasting, and quality control, 
(f) ■ Our results relate to the situation where the Xj's and Yi& take values in finite alphabets and where S is 

bounded by some positive integer k. By using elementary methods based on the analysis of the tree structure of 
stopping times, we exhibit an algorithm that computes the optimal average reaction delays for all a <G [0,1], and 
constructs the associated optimal stopping times T. Under certain conditions on {(Xi, li)}i>i and S, the algorithm 
running time is polynomial in k. 
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■ I. Problem Statement 



(N 

o 



The tracking stopping time (TST) problem is defined as follows. Let {(Xj,Yj)}j>i be a sequence of 
CN ■ pairs of random variables. Alice observes X±,X 2 , . . . and chooses a stopping time (s.t.) S with respect 
to that sequence^ Knowing the distribution of i and the stopping rule S, but having access 

\ only to the K;'s, Bob wishes to find a s.t. that gets as close as possible to Alice's. Specifically, Bob aims 
O ; to find a s.t. T minimizing the expected reaction delay E(T — S) + = Emax{0,T — S}, while keeping 
^ ■ the false-alarm probability P(T < S) below a certain threshold a G [0, 1]. 

2 ! Example 1. Monitoring 

Let Xi be the distance of an object from a barrier at time i, and let S be the first time the object 
■ hits the barrier, i.e., S = mf{i > 1 : Xi = 0}. Assume we have access to Xi only through a noisy 
^ , measurement Yi, and that we want to raise an alarm as soon as the object hits the barrier. This problem 
^ \ can be formulated as the one of finding a s.t. T with respect to the Yj's that minimizes the expected 
reaction delay E(T — S) + , while keeping the false-alarm probability P(T < S) small enough. 

Another situation where the TST problem applies is in the context of communication over channels 
with feedback. Most of the studies related to feedback communication assume perfect feedback, i.e., the 
transmitter is fully aware of the output of the channel as observed by the receiver. Without this assumption 
— i.e., if the feedback link is noisy — a synchronization problem may arise between the transmitter and 
the receiver which can be formulated as a TST problem, as shown in the following example. 

Example 2. Communication 

The authors are with the Massachusetts Institute of Technology, Cambridge, MA 02139. Email: {uniesen,tcham}@mit.edu 
This work was supported in part by NSF under Grant No. CCF-0515122, by DoD MURI Grant No. N00014- 07-1-0738, and by a 
University IR&D Grant from Draper Laboratory. 

1 Recall that a stopping time with respect to a sequence of random variables {Xi}i>i is a random variable S taking values in the positive 
integers such that the event {S — n}, conditioned on {X;}™ =1 , is independent of {Xi}°^ n+1 for all n > 1. A stopping time 5 is non- 
randomized if P(5 = n\X n = x n ) <E {0, 1} for all x n € y n and n > 1. A stopping time S is randomized if P(5 = n\X n = x n ) G [0, 1] 
for all x" G X n and n > 1. 
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Fig. 1. The decoding time S depends on the output of the forward channel. The encoder decides to stop transmission at time T based on 
the output of the feedback channel. If the feedback channel is noisy, S and T need not coincide. 



It is well known that the presence of a noiseless feedback link allows to dramatically increase the 
reliability for a given communication delay (see, e.g., [1]). However, to take advantage of feedback, 
variable length codes are often necessaryo This can be observed by looking at a non-perfect binary 
erasure channel. In this case, any block coding strategy yields a strictly positive error probability. In 
contrast, consider the variable length strategy where the encoder keeps sending the bit it wishes to convey 
until it is successfully received. This simple strategy achieves error-free communication at a rate equal to 
the capacity of the channel in question. Can we still use this coding strategy if the feedback channel is 
(somewhat) noisy? Because of the noisy feedback link, a synchronization problem between the decoder 
and the encoder arises: if the first non-erased output symbol occurs at time S, what should be sent at 
time 5 + 1? This agreement problem occurs because the encoder observes now only a noisy version of 
the symbols received by the decoder (see Fig. [[]). In particular, the first non-erased output symbol may 
not be recognized as such by the encoderJl 

Instead of treating the synchronization issue that results from the use of variable length codes over 
channels with noisy feedback, let us consider the simpler problem of finding the minimum delay needed 
by the encoder to realize that the decoder has made a decision. In terms of the TST problem, Alice 
and Bob represent the decoder and the encoder, the Xi's and Yj's correspond to the input and output 
symbols of the feedback channel, whereas S and T represent the decoding time and the time the encoder 
stops transmission, respectively. Here E(T — S) + represents the delay needed by the encoder to realize 
that the decoder has made a decision, and we aim to minimize it given that the probability of stopping 
transmission too early, P(T < S), is kept below a certain threshold a. 

Note that, in the context of feedback communication, it would be reasonable to define the communication 
rate with respect to the overall delay S + (T — S) + = max{5, T}. This definition, in contrast with the 
one that takes into account only the decoding time (such as for rateless codes), puts the delay constraint 
on both the transmitter and the receiver. In the Example [81 we investigate the highest achievable rate with 
respect to the overall communication delay if the "send until a non-erasure occurs" strategy is used and 
both the forward and the feedback channels are binary erasure. 

Example 3. Forecasting 

A large manufacturing machine breaks down as soon as its cumulative fatigue hits a certain threshold. 
Knowing that a machine replacement takes, say, ten days, the objective is to order a new machine so 
that it is operational at the time the old machine breaks down. This prevents losses due to an interrupted 
manufacturing process as well as storage costs caused by an unused backup machine. 

The problem of determining the operating start date of the new machine can be formulated as follows. 
Let X n be the cumulative fatigue up to day n of the current machine, and let S denote the first day n 
that X n crosses the critical fatigue threshold. Since the replacement period is ten days, the first day T a 

2 The reliability function associated with block coding schemes is lower than the one associated with variable length coding. For symmetric 
channels, for instance, the reliability function associated with block coding schemes is limited by the sphere packing bound, which is lower 
than the best optimal error exponent attainable with variable length coding ([2], [3]). 

3 For fixed length coding strategies over channels with noisy feedback we refer the reader to [4], [5]. 
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new machine is operational can be scheduled only on the basis of a (possibly randomized) function of 
{Xi}J~i°. By defining Y~j to be equal to Xi_ w if i > 10 and else equal to zero, the day T is now a s.t. 
with respect to {^i}i>i, and we can formulate the requirement on T as aiming to minimize E(T — S) + 
while keeping P(T < S) below a certain threshold. 

Note that, in the forecasting example, in contrast with the monitoring and communication examples, 
Alice has access to more information than Bob. From the process she observes, she can deduce Bob's 
observations — here simply by delaying hers. This feature may be interesting in other applications. The 
general formulation where Alice has access to more information than Bob is obtained by letting the 
observation available to Alice at time i be Xi = (X h Yj), and the observation available to Bob be Y { = Y { . 

Example 4. Bayesian Change-Point Detection 

In this Example we will see how the TST setting generalizes the Bayesian version of the change-point 
detection problem, a long studied problem with applications to industrial quality control and that dates 
back to the 1940's [6]. The Bayesian change-point problem is formulated as follows. Let 9 be a random 
variable taking values in the positive integers. Let {Yj}j>i be a sequence of random variables such that, 
given the value of 9, the conditional probability of Y n given Y n ~ x = {Y^}^ 1 is P ('|Y n_1 ) for n < 9 and 
is Pi(-\Y n ~ l ) for n > 9. We are interested in a s.t. T with respect to the YJ's minimizing the change-point 
reaction delay E(T — 9) + , while keeping the false-alarm probability P(T < 9) below a certain threshold 
a e [0,1]. 

Shiryaev (see, e.g. ,[7], [8, Chapter 4.3]) considered the Lagrangian formulation of the above problem: 
Given a constant A > 0, minimize 

E(T - 9) + + AP(T < 9) 

over all s.t.'s T. Assuming a geometric prior on the change-point 9 and that before and after 9 the 
observations are independent with common density function f fort<9 and fi for t > 9, Shiryaev showed 
that the optimal T stops as soon as the posterior probability that a change occurred exceeds a certain 
fixed threshold. Later Yakir [9] generalized Shiryaev's result by considering finite-state Markov chains. 
For more general prior distributions on 9, the problem is known to become difficult to handle. However, 
in the limit a — > 0, Lai [10] and, later, Tartakovsky and Veeravalli [11], derived asymptotically optimal 
detection policies for the Bayesian change-point problem under general assumptions on the distributions 
of the change-point and observed process^ 

To see that the Bayesian change-point problem can be formulated as a TST problem, it suffices to define 
the sequence of binary random variables {Xi}i>i such that Xi = if i < 9 and Xi = 1 if i > 9, and to let 
S — inf{i : Xi = 1} (i.e., S = 9). The change-point problem defined by 9 and {Yj}j>i becomes the TST 
problem defined by S and {(Xi, Yi)}^. However, the TST problem cannot, in general, be formulated as 
a Bayesian change-point problem. Indeed, the Bayesian change-point problem yields for any k > n 

P(0 = k\Y n = y n ,9> n) 



¥ (Y n = y n ,9> n 


9 = fc)P(0 = k) 




9 > ra)P(0 > n) 



f(Y n = y n \9 = k)F(9 = k) 
~ F(Y n = y n \9 > n)F(9 > n) 

= ^(9 = k\9>n) (1) 

since F(Y n = y n \9 = k) = F(Y n = y n \9 > n). Therefore, conditioned on the event {9 > n}, the first 
n observations Y n are independent of 9. In other words, given that no change occurred up to time n, 
the observations y n are useless in predicting the value of the change-point 9. In contrast, for the TST 
problem, in general we have 

F(S = k\Y n = y n ,S>n)jt F(S = k\S > n) (2) 

4 For the non-Bayesian version of the change-point problem we refer the reader to [12], [13], [14]. 
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Fig. 2. Typical shape of the expected delay d(a) as a function of false-alarm probability a. The break-points are achieved by non-randomized 
stopping times. 



because F(Y n = y n \S = k) ± F(Y n = y n \S > n). 

As is argued in the last example, the TST problem is a generalization of the Bayesian change-point 
problem, which itself is analytically tractable only in special cases. This makes an analytical treatment 
of the general TST problem difficult. Instead, we present an algorithmic solution to this problem for an 
arbitrary process {(^i,^i)}i>i and an arbitrary stopping time S bounded by some constant k > 1. The 
proof of correctness of this algorithm provides insights into the structure of the optimal stopping time 
T tracking S, and into the tradeoff between expected delay E(T — S) + and probability of false-alarm 
P(T < S). Under some conditions on {(X^ Yj)}^ and S, the computational complexity of this algorithm 
is polynomial in k. 

The rest of the paper is organized as follows. In Section HH we provide some basic properties of the TST 
problem defined over a finite alphabet process {(Xi, li)}t>i, and in Section [TIT] we provide an algorithmic 
solution to it. In Section [IV] we derive conditions under which the algorithm has low complexity and 
illustrate this in Section [V] with examples. 



II. The Optimization Problem 

Let {(Xi, Yi)}i>i be a discrete-time process where the Xj's and Y^'s take value in some finite alphabets 
X and y, respectively. Let S be a s.t. with respect to {Xj}j>i such that S < k almost surely for some 
constant k>\. We aim to find for any a £ [0, 1] 

d(a) = min E(T - S) + (3) 

T:¥(T<S)<a V 

where the s.t.'s T are possibly randomized. Note that the restriction T < k induces no loss of optimality. 

Now, the set of all s.t.'s over {Yi}i>i is convex, and its extreme points are non-randomized s.t.'s 
([15], [16]). This implies that any randomized s.t. T < k can be written as a convex combination of 
non-randomized s.t.'s bounded by k, i.e. 

P(T = k) = Y, WjFiTj = k) 
j 

for any integer k, where {!)} denotes the finite set of all non-randomized s.t.'s bounded by k, and where 
the Wj's are nonnegative and sum to one. Hence, because false-alarm and expected reaction delay can be 
written as 

P(T <S)=Y, w^Tj < S) 
j 

E(T-S)+ = Y,wMT-S)\ 

j 

the function d(a) is convex and piecewise linear, with break-points achieved by non-randomized s.t.'s. Its 
typical shape is depicted in Figure [2l 
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For A > 0, define the Lagrangian 

J A (T) = E(T - S) + + XF(T < S). (4) 

Lemma 1. We have 

d(a) = sup min (J\(T) — Xa) , 

A>0 r < K 

where the minimization is over all non-randomized s.t.'s bounded by k. 

Proof. The convex minimization problem in © admits at least one feasible point, namely T = k. Therefore 
strong Lagrange duality holds (see, e.g., [17, Chapter 5]), and we obtain 

d(a) = sup min (J\(T) — Xa) . (5) 

A>0 T < K 

Because d(a) is convex with extreme points achieved by non-randomized s.t.'s, we may restrict the 
minimization in © to be over the set of non-randomized s.t.'s bounded by k. □ 

III. An Algorithm for Computing d(a) 

We first establish a few preliminary results later used to evaluate min r J\(T). Emphasis is put on the 
finite tree representation of bounded s.t.'s with respect to finite alphabet processes. We then provide an 
algorithm that computes the entire curve d(a). 

We introduce a few notational conventions. The set y* represents all finite sequences over y. An 
element in y* is denoted either by y n or by y, depending on whether or not we want to emphasize its 
length. To any non-randomized s.t. T, we associate a unique |^|-ary tree T (i.e., all the nodes of T have 
either zero or exactly \y\ children) having each node specified by some y E y*, where py represents the 
vertex path from the root p to the node y. The depth of a node y n E T is denoted by l(y n ) = n. The 
tree consisting only of the root is the trivial tree. A node y n E T is a leaf if P(T = n\Y n = y n ) = 1. 
We denote by C(T) the leaves of T and by T(T) the intermediate (or non-terminal) nodes of T. The 
notation T(T) is used to denote the (non-randomized) s.t. T induced by the tree T. Given a node y in 
T, let T y be the subtree of T rooted in y. Finally, let V(T y ) denote the descendants of y in T. The next 
example illustrates these notations. 

Example 5. Let y = {0, 1} and k = 2. The tree T depicted in Figure [3] corresponds to the non-randomized 
s.t. T taking value one if Y x = 1 and value 2 if Y x = 0. The sets £(T) and J(T) are given by {00, 01, 1} 
and {p, 0}, respectively. The subtree % of T consists of the nodes {0, 00, 01}, and its descendants T>(%) 
are {00, 01}. The subtree T p is the same as T, and its descendants V(T p ) are {0, 1, 00, 01}. 




Fig. 3. Tree corresponding to the s.t. T defined by T = 1 if Yx = 1, and T = 2 else. 

Below, we describe an algorithm that, for a given s.t. S, constructs a sequence of s.t.'s {T(T m )}^f =0 and 
Lagrange multipliers {A m }^[ =0 with the following two properties. First, the T m 's and A m 's are ordered in 
the sense that T M C T u ~ x c . . . C T° and = X M < X M -i < ■ ■ ■ < Ai < A = oo. (Here the symbol 
C denotes inclusion, not necessarily strict.) Second, for any m E {0, . . . , M} and A E (A m , A m _i] the tree 
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q-m-i m i n i m j zes J X (T) = J\(T{T)) among all non-randomized s.t.'s. The algorithm builds upon ideas 
from the CART algorithm for the construction of classification and regression trees [18]. 

Before we state the algorithm, we need to introduce a few quantities. Given a non-randomized s.t. T 
represented by its |[V|-ary tree T, we write the Lagrangian J\(T) as 

J A (T) = E(T - S) + + AP(T < S) 

= £ F(Y = y)(E((l(y)-S) + \Y = y 

yeC(T) v 



+ X¥[S>l(y)\Y = y 

= £ b(y) + Xa(y) 

ye£(T) 

= E Jx(y), 

where 

a(y)±F(Y = y)F(S>l(y)\Y = y), 
b{y) 4 P(y = 2/)E((J(t/) - + = y 

•fc(v) = %) + Mv) ■ 

We extend the definition of J\( ) to subtrees of T by setting 



With this definition! 



Similarly, we define 



Mv) if yeC(T), 



x{ y) \ E^yMW if ye i(T). 



a ( T y) - E a (7), 

W= E 6(7). 

7 e£(T y ) 

For a given A > and T, define T(A) C T to be the subtree of T having the same root, and such that 
Jx(T(A)) < J\CO f° r a U subtrees (with same root) V C T, and T(A) c V for all subtrees (with same 
root) T'cT satisfying J A (T(A)) = J\(T'). In words, among all subtrees of T yielding a minimal cost 
for a given A, the tree T(A) is the smallest. As we shall see in Lemma [2l such a smallest subtree always 
exists, and hence T(A) is well defined. 

Remark. Note that 7^(A) is different from (T(A)) y . Indeed, T y (X) refers to the optimal subtree of T y 
with respect to A, whereas (T(A)) y refers to subtree rooted in y of the optimal tree T(A). 

Example 6. Consider again the tree T in Figure [3l Assume J\G°) = 4, J\{0) = 2, Ja(1) = J\{00) = 
J A (01) = 1. Then 

Ja(T) = J A (1) + Ja(00) + Ja(01) = 3, 
M%) = Ja(00) + J A (01) = 2. 



5 We used T, T, T y , and y, as possible arguments of J\(-). No confusion should arise from this slight abuse of notation, since for 
non-randomized s.t.'s all of these arguments can be interpreted as trees. 
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The smallest optimal subtree of T having the same root is T(A) = {p, 0,1} and 

J A (T(A)) = J A (0) + J A (1) = 3. 
The smallest optimal subtree of T having the same root is %(X) = {0} and 

Jx(%(\)) = J A (0) = 2. 



Given a |3^|-ary tree T, and A > 0, the following lemma shows that T(A) always exists and characterizes 
T(A) and J A (T(A)). The reader may recognize the finite-horizon backward induction algorithm whose 
detailed proof can be found in standard textbooks (e.g., [19, Chapter 3 and 4]). 

Lemma 2. Given a \y\-ary tree T and A > 0. For every y £ 2T(T), 

Jx(T y (X)) = mm{J x (y), £ JA(T y7 (A))}, 

and 

'{y} if Jx(y) < E.ey Jx(T y ,(X)) 

{y} U 7£ y T y7 (A) e/je. 



r/ze optimal tree T(X) and the corresponding cost J\(T(\)) are given by J\{T V (\)) and T y {\) evaluated 
at y = p. 

Proof. By induction on the depth of the tree starting from the root. □ 

From the structure of the cost function J\( ), the larger the value of A, the higher the penalty on the 
error probability. Therefore one expects that the larger the A the "later" the optimal tree T(A) will stop. 
Indeed, Lemma |3] states that the tree corresponding to the optimal s.t. of a smaller A is a subtree of the 
tree corresponding to the optimal s.t. of a larger A. In other words, if A < A, in order to find T(A), we 
can restrict our search to subtrees of T(\). 

Lemma 3. Given a tree T, if A < A then T (A) C T(X). 

Proof. We have 

a(T y ) = Y, F (S> l(yi)\Y 1 ^ = yi )¥{Y 1 ^ = yi ) 
yyeC(T y ) 

< F ( S > KyW 1 ^ = y-fMY 1 ^ = 2/ 7 ) 

y-yeC(Ty) 

= a(y). (6) 

Similarly one shows that b(T y ) > b(y). 

By contradiction, assume A < A, but T(A) is not a subset of T(A). Then there exists y £ £(T(A)) 
such that y £ Z(T(A)). By definition of T(A) and Lemma [2] 

J-M < J-xW)), 

and thus 

b(y) + Xa(y) < b(T y (X)) + \a(T y (\)). (7) 
Now, since a(T y (X)) < a(y), and A < A, 

(X-~X)a(y)<(X-X)a(T y (X)). (8) 

Combining © and © yields 

b(y)+Xa(y)<b(T y (X)) + Xa(T y (X)), 



s 



and therefore 

Uv) < Jx(T y (X)). 

Since y E Z(T(A)), this contradicts the definition of T(A) by Lemma [2] □ 

The next theorem represents a key result. Given a tree T, it characterizes the smallest value A can take 
for which T(A) = T. For a non-trivial tree T, define for any y E T(T) 



KT y ) - b{y) 
a(y) - a(T v ) 



where we set 0/0 = 0. The quantity g(y, T) captures the tradeoff between the reduction in delay b(T y ) — 
b(y) and the increase in probability of false-alarm a(y) — a(T y ) if we stop at some intermediate node y 
instead of stopping at the leaves C(T y ) of T. 

Theorem 4. For any non-trivial tree T 

inf {A > : T(A) = T}= max g (y, T) . 

Proof. Let T be a non-trivial tree and y E T(T). We have 

J x (T y )-\a(T y )-J x (y) + \a(y) 



g(y,T) 



Jx(T y ) - J x (y 



a(y) - a(T y ) 
+ A. 



a(y) - a(T yJ 

By ©, a(T y ) < a(y), and hence the following implications hold: 

g(y,T)<\^J x (y)>J x (T y ), 
g(y,T) <A^ J x (y) > J x {T y ). 

Therefore, if max ye j(r) g(y,T) < A then 

J x (y) > J x (T y ) (10) 



(9) 



for all y E T(T). 

We first show by induction that if 



max q(y, T) < A 



then T(A) = T. Consider a subtree of T having depth one and rooted in y, say. Since by (flOl) J x (y) > 
J x (T y ), we have T y (X) = T y by Lemma[2l Now consider a subtree of T with depth k, rooted in a different 
y, and assume the assertion to be true for all subtrees of T with depth up to k — 1. In order to find T y (X), 
we use Lemma [2] and compare J x (y) with J2~/ty Jx(%rr(^))- Since T yi is a subtree of T with depth less 
than k, we have T yi {\) = T yi by the induction hypothesis. Therefore 

£ J A (T y7 (A)) = 2 J A (T y7 ) = J x (7y), 

■yey -ysy 

and since J x (T y ) < J x (y) by (ITOT ). we have 7y(A) = T y by Lemma [21 which concludes the induction 
step. Hence we proved that if max ye j(r) g(y, T) < A, then T(A) = T. 
Second, suppose 

max q(y, 7j = A. 
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T 1 



Ai 



T 2 



A :! 



Fig. 4. For all m £ {0, 1, . . . ,M — 1} the tree T m is the smallest tree minimizing the cost Ja(-) for any A £ (A m +i, A m ]. 



In this case there exists y G such that J\(T y j = J\(y). We consider the cases when T yi (X) and 

Tyy arc the same for all 7 G ^ and when they differ for at least one 7 G y. If T yi (X) = T yi for all 7 G ^ 
then 

£ J A (T y7 (A)) = J A (T y ) = J A (y), 

and thus T(A) ^ T by Lemma |2l If 7y 7 (A) 7^ 7y 7 for at least one 7 G ^ then T(A) 7^ T again by 
Lemma |2] 
Finally, when 

max q{ y, T) > A 

then T(A) 7^ T follows from the previous case and Lemma [H □ 

Let T° denote the complete tree of depth k. Starting with A = 00, for m = {1, . . . , M} recursively 
define 

A m 4 inf{A < A m _i : T m ~\\) = T m ~ l }, 



where M is the smallest integer such that \m+i = 0, and with Ai = 00 if the set over which the infimum 
is taken is empty. Lemma [3] implies that for two consecutive transition points A m and \ m +i, we have 
T°(A) = T°(A m ) for all A G (A m+ i, A m ] as shown in Figure gj 

The following corollary is a consequence of Lemma [3] and Theorem |4] 

Corollary 5. For m G {1, . . . , M} 

X m = max g(y, T— 1 ), (11) 

yGX(T m - 1 ) 

T m = T m " 1 \ |J V(J™~ 1 ). (12) 
Moreover, the set {(a m ,d m )}^ =1 with 



g(y,T™- 1 )=A„ 



« m ^p(T(r)<5), 

rf m ^ E(T(T m ) - S) + , 

are ?/ze break-points of d(a). 

Proof. Let T m_1 be fixed. Equation (fTTT) follows directly from Theorem HI For (fTIl) . notice that as J\(T) 
is continuous in A, the definition of A m yields Jx^T 111 ' 1 ) = J\ m (T rn ). Hence T m is the smallest subtree 
of T m_1 with same root, and having a cost equal to Jx m (^T m ~ 1 )- From ® and Lemma [2l we deduce that 
T m is obtained from T" 1 " 1 by removing the descendants of any y G X(T m ~ 1 ) such that g(y, T m ~ l ) = \ m . 

It remains to show that {(a m , d m )}^ =1 are the break-points of d(a). By Lemma CD, the break-points 
are achieved by non-randomized s.t.'s. By Lemma [3] we have T m = T°(A m ), i.e., T m is the smallest 
subtree of T° having the same root and minimizing the cost J\ m (T). Hence, among the minimizers of 
J\ m (T), T m yields the largest P(T(T) < S). Therefore each pair (a m , d m ) is a break-point. Conversely, 
given a break-point of d(a), let T be the smallest subtree of T° achieving it. Then T = T°(X) for 
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some A. Since T°(A m ) = T m we have that {T°(A)} AgK = {T m }* f =0 , and therefore T = T m for some 
me {1,...,M}. □ 

From Corollary [51 we deduce the algorithm below that fully characterizes d(a) by computing its set of 
break-points {(a m , d m )}^ =1 . 

Algorithm: Compute the break-points {(a m , d m )}^ =1 of d(a) 



m -<= 
Ao <= oo 

T° <= complete tree of depth k 
repeat 

m <= m + 1 

A m <*= max 3/eJ(rm -i ) g(y, T m ^ 

T m ^ T m-X \ y yeI(rm _ 1): V(T™~ 1 ) 
g{y,T m - 1 )=X m 

a m <= P(T(T m ) < 5) 

d m <= E(T(T m ) - 5)+ 
until A m = 
M «<= m - 1 



As a |3^j -ary tree has less than \y\ K non-terminal nodes, the algorithm terminates after at most that many 
iterations. Further, one may check that each iteration has a running time that is exp(0(/c)). Therefore, 
the worst case running time of the algorithm is exp(0(/c)). This is to be compared, for instance, with 
exhaustive search that has a f2(exp exp(re)) running time (because all break-points of d(a) are achieved 
by non-randomized s.t.'s and there are already 2^ K \y\-ary trees having leaves at either depth k or 
k- 1). 

In Sections ITVl and IVl we will see that, under certain conditions on {{X^Yj)}^ and S, the running 
time of the algorithm is only polynomial in ac. 



A. A Lower Bound on the Reaction Delay 

From Corollary [51 we may also deduce a lower bound on d(a). Since d(a) is convex, we can lower 
bound it as 

d(a) > d{0) + ad'{0+) (13) 

where d'(0+) denotes the right derivative of d at a = 0. By Corollary [5J if Ai < oo then d(Q) is achieved 
by the complete tree T°, and if A x = oo then d(0) is achieved by T 1 which is a strict subtree of T°. 
Hence ([T3l can be written as 

u \ \ \d\\ if Ai < OO, 

d(a) > d(0) -1 (14) 
I a\ 2 else. 

Note that the above bound is tight for a < a.\ with a\ > when X 1 < oo, and is tight for a < a 2 with 
a 2 > when Ai = oo. The following example illustrates this bound. 

Example 7. Let {Xj}j>i be i.i.d. Bernoulli (1/2), and let the Kj's be the output of a binary symmetric 
channel with crossover probability p £ (0, 1/2) for input Xj. Consider the s.t. S defined as 




For ac = 2, the tree corresponding to this s.t. is depicted in Figure [3] 



1 1 



Since p e (0, 1/2), it is clear that whenever T is not the complete tree of depth k, we have P(T(T) < 
S) > 0, hence 

d(0)=E(T(T°)-S)+ = i(K-l). 
An easy computation using Corollary [5] yields 

p 

and, using (PT41) . we get 

d(a) > (k - 1) f- - a— -\ (15) 



,2 p 

Let us comment on ([15)). Consider any two correlated sequences {Xj}j>! and {Yi}i>i and a s.t. S 1 with 
respect to the X/s. Intuition tells us that there are two factors affecting d(a). The first is the correlation 
between the Xj's and Yj's, in the above example parameterized by p. The lower the correlation, the higher 
d(a) will be. The second factor is the "variability" of S, and might be characterized by the difference 
in terms of depth among the leaves having large probability to be reached. In the above example the 
"variability" might be captured by n — since with probability 1/2 a leaf of depth 1 is reached, and with 
probability 1/2 a leaf of depth k is attained. 

Example 8. We consider one-bit message feedback communication when the forward and the feedback 
channels are binary erasure channels with erasure probabilities e and p, respectively. We refer the reader to 
Example [2] in Section U for the general problem setting. We use the following transmission scheme (which 
is optimal in the case of noiseless feedback). The decoder keeps sending over the feedback channel 
until time S, the first time a non-erasure occurs or k time units have elapsed. From that point on, the 
decoder sends 1. The encoder keeps sending the message bit it wants to deliver until time T (a stopping 
time with respect to the output of the feedback channel). Ideally, we would like to choose T = S. This 
is possible if the feedback is noiseless, i.e., p = 0. If p > 0, we want to track the decoding time S as 
closely as possible. The constant k plays here the role of a "time-out." In the following, we assume that 
e,pe (0,1). 

Let us focus on d(a). One can show that Ai = oo and therefore the bound ([141) becomes d(a) > 
d(0) — a\ 2 , where A 2 = ma,x yeI ^ T i^ g(y, T 1 ) from Corollary [51 A somewhat involved computation yields 

d(a) > - e l - K o^j (1 + o{\)) (16) 

as k — ► oo. 

The delay d(a) is interpreted as the time it takes the encoder to realize that the decoder has made a 
decision. Equation (|T6b relates this delay to the channel parameters e and p, the probability a of stopping 
retransmission too early, and the value of the "time-out" k. For the communication scheme considered 
here, there are two events leading to decoding errors. The event {X K = 0}, indicating that only erasures 
were received by the decoder until time k, and the event {T < S}, indicating that the encoder stopped 
retransmission before the decoder received a non erasure. In both cases the decoder will make an error 
with probability 1/2. Hence the overall probability of error P(£) can be bounded as 



max{a, e K } < 2P(£) < a + e* 

log 

lot 

same weight. This results in a delay of 



It is then reasonable to choose k = i.e., to scale k with a so that both sources of errors have the 

log e ' ' 



d(a)>l T ^--e\ (l + o(l)) 



as a — > 0. 
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Now suppose that the communication rate R is computed with respect to the delay from the time 
communication starts until the time the decoder has made a decision and the encoder has realized this, i.e., 
ES +E(T — S) + = E(max{S', T}). We conclude that the "send until a non-erasure" strategy asymptotically 
achieves a rate that is upper bounded as 

R< - p • 

r- + -r- -£ 

When e < p/(l — p), our bound is strictly below the capacity of the binary erasure channel 1 — e. Hence 
1/(1 + e) represents a critical value for the erasure probability p of the feedback channel above which 
the "send until non-erasure" strategy is strictly suboptimal. Indeed there exist block coding strategies, 
making no use of feedback, that (asymptotically) achieve rates up to 1 — e, the capacity of the forward 
channel. 



IV. Permutation Invariant Stopping Times 

We consider a special class of s.t.'s and processes {(Xj,Yj)}j>i for which the optimal tradeoff curve 
d(a) and the associated optimal s.t.'s can be computed in polynomial time in k. 
A s.t. S with respect to {Xi}^ is permutation invariant if 

F{S < n\X n = x n ) = F{S < n\X n = vr(x n ')) 

for all permutations n : X n — * X n , all x n £ X n and n £ {1, . . . , k}. Examples of permutation invariant 
s.t.'s are infjz : Xi > c} or inf{i : J2l=i > c} for some constant c and assuming the AYs are positive. 
The notion of a permutation invariant s.t. is closely related to (and in fact slightly stronger than) that of 
an exchangeable s.t. as defined in [20]. 

The following theorem establishes a key result, from which the running time of one iteration of the 
algorithm can be deduced. 

Theorem 6. Let {(Xj,Fj)}j>i be Ltd. and S be a permutation invariant s.t. with respect to {Xi}i>i. If 
T(T) is non-randomized and permutation invariant then 

g(y,T)=g(ir{y),T) 

for all y £ X(T) and all permutations n. 

We first establish two lemmas that will be used in the proof of Theorem [6] 

Lemma 7. Let T be a non-randomized s.t. with respect to {Yi}^ and T the corresponding tree. Then 
T is permutation invariant if and only if for all y £ X(T) and permutations it, ir(y) £ 2T(T). 

Proof. Assume T is permutation invariant and let y n £ 1(T). Then 

= P(T < n\Y n = y n ) = P(T < n\Y n = n(y n )), 

and hence 7r(y n ) £ X(T). 

Conversely assume that, for all y £ T(T) and permutations tt, we have n(y) £ T(T). Pick an arbitrary 
y n . First, if P(T < n\Y n = y n ) = 0, then y n £ J(T), and by assumption also ir(y n ) £ J(T). Thus 
P(T < n\Y n = n(y n )) = 0. Second, if P(T < n\Y n = y n ) = 1, then y n £ J(T), and by assumption also 
?r(V) i Z(T). Thus P(T < n\Y n = 7i(y n )) = 1. □ 

Lemma 8. Let {(Xi, 1^) }i>i be i.i.d. and S be a permutation invariant s.t. with respect to {Aj}j>!. Then 
S is a permutation invariant s.t. with respect to {Fj}j>i. 
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Proof. Using that the {{X h l;)}i>i are i.i.d., one can easily check that S is a s.t. with respect to {Yi}^. 
It remains to show that it is permutation invariant. For any permutation ix : X n — > X n 



< n \Y n = y n ) 
= Y F(S < n\X n = x n )F{X n = x n \Y n = y n ) 

= Y <n\X n = 7i-\x n ))x 

xP(I n = Ti-\x n )\Y n = y n ) 
= Y F ( S ^ n \ Xn = x n MX n = x n \Y n = ir{y n )) 

= F(S < n\Y n = ir(y n )), 

where the second last equality follows by the permutation invariance of S and the fact that the (Xj, Yj)'s 
are i.i.d. □ 

Proof of Theorem \6\ We show that 

9(v,T)= b ™- b f» ] . \ =g(*(y),T) (17) 
a(y) - a(T y ) 

for all y E We prove that the numerator and the denominator in (fTTI) remain unchanged if we 

replace y by n(y). Fix some y = y n E 2T(T), and, to simplify notation, set / = ^(7) until the end of this 
proof. For the denominator, using Lemma [8] we obtain 

a(y) - a(T y ) 
4 a(y n ) - J2 <V n l) 

= ¥>(Y n = y n )F(S > n\Y n = y n ) Y P (y™+' = y n j)F(S >n + l)\Y n+l = y n ~f) 

= F(Y n = n(y n ))F(S > n\Y n = 7i(y n )) Y F(Y n+l = 7r(y n )y)F(S >n + l)\Y n+l = n(y n )j). 

(18) 

A consequence of Lemma [7] is that the set of all 7 such that y n ~f E C(T y n) is identical to the set of all 
7 such that ir(y n )~f E C(%^). Hence by £[8]) 

a (y n ) - a{T y n) = a(ir(y n )) - a(% {yn) ). 

For the numerator in (fT7l) . we have 

b(T yn )-b(y n ) 

= Y F ( Yn+l = y n ~f)(^((n + l- S) + \Y n+l = y n ~f) -E((n- S) + \Y n+l = y n j)\ (19) 

2/«-yeC(V) 



By Lemma [8] 



E[(n + l-S) + Y n+l = y n j) - E[(n - S) + Y n+t = y n j 

n+l-l 

= Y ¥ ( s ^ k \ yn+l = y n ~i) 

k=n 

= Y F (S< k\Y n+l = 7^)7). 

k=n 
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□ 



We now show that one iteration of the algorithm has only polynomial running time in k. Specifically, 
we evaluate the running time to compute T m+1 from T m if S and T(T m ) are permutation invariant and 
if the (Xi, Yj)'s are i.i.d. To that aim, we assume the input of the algorithm to be in the form of a list of 
the probabilities F(S < n\X n = x n ) for all x n E X n and n E {1, . . . , n} — specifying S — and a list of 
F(X = x,Y = y) for all x E X and y E y — characterizing the distribution of the process {(Xi, 5^)}i>i- 
Note that as S is permutation invariant, we only have to specify F(S < n\X n = x 11 ) for each composition 
(or type) of x n . Since the number of compositions of length at most k is upper bounded by (k + l) 1+ \ x \ 
— any element x E X appears at most k times in a string of length k — the list of these probabilities has 
only polynomial size in k. Using a hash table, we assume that, given x'\ the element P(5 < n\X n = x n ) 
in the list can be accessed in 0(k) time. The proof of the following theorem is deferred to the appendix. 

Theorem 9. Let {pQ,^)}i>i be i.i.d., let S and T(T m ) be permutation invariant s.t.'s with respect to 
{Xi}i>i and {Yi}i>i respectively, and let a m = P(T(T m ) < S) and d m = E(T(T m ) — S) + be given. 
Then T m+1 , a m+ i, and d m+ i can be computed in polynomial time in k. 

As a corollary of Theorem [9l we obtain the worst case running time for computing the set of break-points 
{(dim, dm)}m=i together with the associated optimal s.t.'s {T m }^f =0 . 

Corollary 10. Let {(Xj,Fj)}j>i be i.i.d. and S be a permutation invariant s.t. with respect to {Xj}j>i. 
If all {T m }^f =0 are permutation invariant, then the algorithm has a polynomial running time in k. 

Proof. By Theorem [9] we only have to bound the number of iterations of the algorithm. To this end note that 
by Theorem [6] every composition of y can be only once a maximizer of g(y, T m ) (as the corresponding 
nodes will be leaves in the next iteration of the algorithm). Hence, there are at most 0((k + 
iterations. □ 

Note that, in the cases where {T m }^f =0 are not permutation invariant, one may still be able to derive 
a lower bound on d(a) in polynomial time in k, using (fT4l) . Indeed, the tree T° is permutation invariant 
since it is complete and, by Theorem [9l if {(Xi, ii)}i>i are i.i.d. and S is permutation invariant, then the 
first subtree T 1 can be computed in polynomial time in k. Therefore the bound 



can always be evaluated in polynomial time in k when the (Xi,Yi)'s are i.i.d. and S is permutation 
invariant. Note that this bound is in general weaker than the one derived in Section IIII-AI However, when 
Ai < oo the bound (T20l) is tight for a E [0, a\] for some ct\ > 0. It is easily checked that the condition 
Ai < oo is satisfied if F(S = k, Y K ~ l = y K ~ l ) > for all y*" 1 . 

In the next section, we present two examples for which the conditions of Corollary [10] are satisfied, and 
hence for which the algorithm has a polynomial running time in k. First, we consider a TST problem that 
indeed can be formulated as a Bayesian change-point problem. Second, we consider the case of a pure 
TST problem, i.e., one that cannot be formulated as a Bayesian change-point problem. For both examples, 
we provide an analytical solution of the Lagrange minimization problem min T < K J\(T). 

V. One-step Lookahead Stopping Times 

In this section, we show that under certain conditions the s.t. that minimizes the Lagrangian J\(T) can 
be found in closed form. 
Define 



d(a) > d(0) - a\ 



(20) 



A. 



n 



A 
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and let 

T* = min {k, inf{n : Y n G A}}- (21) 
In words, T A stops whenever the current cost 

E((n - + = y") + AP(S > n|F n = y n ) 
is less than the expected cost at time n + 1, i.e., 

E((n + 1 - S) + \Y n = y n ) + X¥(S > n + l\Y n = y n ) . 

Recall that T° denotes the complete tree of depth k. For (X h Y^)'s i.i.d., Theorem ITTI provides a sufficient 
condition on S for which T(T°(A)) = T£. In words, the s.t. T£ minimizes J\(T) among all s.t.'s bounded 
by k. Furthermore, among all stopping times minimizing J\(T), the s.t. T A admits the smallest tree 
representation. The proof of Theorem [TTJ is deferred to the appendix. 

Theorem 11. Let {(X i5 Yj)}^ be Ltd., and let S be a s.t. with respect to {Xi}^ that satisfies 

P(5 = n\Y n ~ l ) > P(5 = n + l\Y n ) (22) 

for all n G {2, . . . , k}. Then 

T(T°(A)) = T A *. 

Note that, unlike the algorithm, Theorem [TT| provides an analytical solution only to the inner mini- 
mization problem in ©. To find the reaction delay d(a) one still needs to maximize over the Lagrange 
multipliers A. 

Using Theorems [lOl and ITTI we now give two examples of process {(Xj, Yj)}j>i and s.t. S for which 
the algorithm has only polynomial running time in k. 

Example 9. Let {pQ, 5^)}i>i be i.i.d. with the X^s taking values in {0, 1}. Consider the s.t. S = mf{i : 
X t = 1}. We have for n > 2 

F(S = nir™" 1 ) = F(S > n|y n - 1 )P(X„ = 1) 

> ¥(S > n\Y n ~ l )W(X n = 0|F n )P(X„ +1 = 1) 
= P(5 = n + l\Y n ). 

Hence, Theorem [TTJ yields that the one-step lookahead stopping time T{ defined in (12D) satisfies 

T(T°(A)) = T*. 

We now show that the algorithm finds the set of break-points {(a rn , d rn )}^ =0 and the corresponding 
{%n}m=o m polynomial running time in k. To that aim, we first show that is permutation invariant. 
By Lemma [71 we equivalently show that, for all y n and permutations n, if y n £ An then ir(y n ) £ A n . 
We have for n < k 



E -My n i) - J\(y n ) = HY n = y n )[HS < n\Y n = y n ) - xf(s = n + i\Y n = y r 

= F{Y n = 7r{y n ))(p{S < n\Y n = 7r{y n )) - XF{S = n + l\Y n = 7r{y n )) 



= J2 Jx(n(y n h) ~ JxW)), (23) 

-rey 

where we have used Lemma [8] for the second equality. Thus y n ^ A n implies ir(y n ) A n , and therefore 
T A * is permutation invariant. Since T(T°(A)) = T A * for all A > by Theorem [TTJ all {T m }™ =0 are 
permutation invariant. Finally, because S is permutation invariant, applying Corollary [TTJ we conclude 
that the algorithm has indeed polynomial running time in k. 
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The problem considered in this example is actually a Bayesian change-point problem, as defined in 
Example |4] in Section HI Here the change-point Q = S has distribution P(6 = n) = p(l — p) n_1 , where 
p = P(X = 1). The conditional distribution of Yi given 6 is 





[P(K t 


= Vi\Xi = 0) 


if % < n, 


p(y i = ^16 = 71) = ! 


P(K t 


= Vi\Xi = l) 


if % — n, 






= Vi) 


if i > n. 



Note that, unlike the case considered by Shiryaev (see Example |4] in Section U), the distribution of the 
process at the change-point differs from the ones before and after it. 

We now give an example that cannot be formulated as a change-point problem and for which the 
one-step lookahead s.t. T£ minimizes the Lagrangian J\(T). 

Example 10. Let {(Xj, "5^)}«>i be i.i.d. where the Xj's and Y^s take values in {0, 1}, and let S — inf{i > 
1 : Y^j=\Xj = 2}. A similar computation as for Example |9] reveals that if 

P(X, = 1]^) > P(X, = 0^) 

then Theorem [TTI applies, showing that the one-step lookahead stopping time T£ defined in (1271) satisfies 
T(T°(A)) = T A *. 

Furthermore, since S is permutation invariant, (|23l) shows that T£ is permutation invariant. Applying 
Corollary [TO], one deduces that the algorithm has polynomial running time in k in this case as well. 
The problem considered here is not a change-point problem since, for k > n 

F(S = k\Y n = y n ,S>n)^ F(S = k\S > n), 

and therefore ([T]) does not hold. 

VI. Remarks 

In our study, we exploited the finite tree structure of bounded stopping times defined over finite alphabet 
processes, and derived an algorithm that outputs the minimum reaction delays for tracking a stopping time 
through noisy observations, for any probability of false-alarm. This algorithm has a complexity that is 
exponential in the bound of the stopping time we want to track and, in certain cases, even polynomial. 
In comparison, an exhaustive search has a complexity that is doubly exponential. 

The conditions under which the algorithm runs in polynomial time are, unfortunately, not very explicit 
and require more study (see Corollary 10). Explicit conditions, however, are expected to be very restrictive 
on both the stochastic process and the stopping time to be tracked. 

For certain applications, it is suitable to consider stopping times defined over more general processes, 
such as continuous time over continuous alphabets. In this case, how to solve the TST problem remains 
a wide open question. As a first step, one might consider a time and alphabet quantization and apply our 
result in order to derive an approximation algorithm. 
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Appendix I 
Proof of Theorem [9] 

In the following we write T for T rn . From Theorem [6l to find the y G T(T) maximizing g(y, T), we 
only have to compute g(y, T) for all possible compositions of y. The number of such compositions is 
0((k+ l) 1+ \ y \). We now show that g(y,T) can be computed in polynomial time in k. From the proof 
of Theorem [6l we have to show that F(S < n\Y n = y n ) can be computed in polynomial time, and that 
the sums in (fT8l) and < fT9l can be computed in polynomial time. 

We have 

F(S < n\Y n = y n ) = ]T P(5 < n\X n = x n )F{X n = x n \Y n = y n ). 

x n £X n 

Each term in the summation on the right hand side depends only on the composition of (x n ,y n ), and 
hence P(5 < n\Y n = y n ) can be computed in polynomial time in k. 
Consider now the sum over all y n ~f G £{T y n) in (U~8T) 

E <y n i)= E <v n ii)- (24) 

By Lemma[71 ^"77 G £■(%)*<■) if and only if ^"^(7)7 G C{T y n) for all permutations 7r. And as a(y n 'j'j) = 
a(y n 7i('y)'j), we can compute (T24l) in polynomial time in k. 

Consider next the sum over all y n/ y G C(T yn ) in (TT9l) . Using Lemma [8] 

n+J(-y)-l 
y n yeC(T y n) k=n 

= W{Y n+l{ ^ = y n j)F(S <n + l{j)\Y n+l ^ = y n j). 

Applying Lemma [7] as before, we conclude that the right-hand side can be computed in polynomial time 
in k. 

It remains to prove that a rn+1 and d m+ i can be computed in polynomial time in n from a m and 
d m . This follows from the same argument, as it suffices to compute the differences b(T y *) — b(y*) and 
a(y*) — a(T y *) for all y* maximizing g(y,T). 

Appendix II 
Proof of Theorem [TT1 

Fix some A > 0. Let us write J\(T) as E(c(F T )) where 

c(y n ) 4 E((n - S)+\Y n = y n ) + AP(S > n\Y n = y n ). 

We say that the {An} are nested if, for any n > 1 and 7 G y, we have that y n G -4.„ implies y n 7 G Ai+i- 
We show that (1221) implies that the {^l n } are nested, and that this in turn implies that the one-step 
lookahead stopping rule is optimal. The second part of the proof is well known in the theory of optimal 
stopping and is referred as the monotone case (see, e.g., Chow et al. [19, Chapter 3]). Here we provide 
an alternative proof that emphasizes the tree structure of stopping times. 

Note that y n G A n if and only if E(c(F n+1 )|y n = y n ) > c(y n ). We now show that 

E{c{Y n+1 )\Y n ) > c(Y n ) F(S < n\Y n ) > XF(S = n + l\Y n ). (25) 
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Since {pQ, Fj)}j>i are i.i.d., S is also a (randomized) s.t. with respect to {Y*}i>i by Lemma [8] It follows 
that 

c{Y n+1 ) = E((n + 1 - S) + \Y n+1 ) + AP(S > n + l|F n+1 ) 

n 

= 5]P(S< A;|y n+1 ) + AP(S > n + l|F n+1 ) 
fc=i 

n-l 

= E p ( 5 < + p ( 5 < n l yn ) 
*;=i 

+ \¥{S > n\Y n ) - AP(S = n + l\Y n+l ) 
= c(Y n ) + P(S < n\Y n ) - AP(S = n + l|F n+1 ), 

from which one deduces (|25l) . 

Next, we prove that the {A n } are nested. By (|25T) this is equivalent to showing that, whenever for some 

y n 

P(S < n\Y n = y n ) > AP(S = n + l\Y n = y n ), (26) 

we also have 

P(5 < n + l|F n+1 = y n 7 ) > AP(S = n + 2|r n+1 = y n j) (27) 

for any 7 G 3^. Suppose that (|26|) holds for some y n . Using the fact that S is a s.t. with respect to the 
Yi's (Lemma [8]) together with the hypothesis of the theorem yields for any 7 

P(5 < n + l|y n+1 = y" 7 ) - AP(S = n + 2\Y n+1 = y n 7 ) 

> P(5 < n\Y n = y n ) - \(S = n + 2|F n+1 = y n 7 ) 

> A(P(S = n + = y n ) - F(S = n + 2|F n+1 = y n 7 )) 
>0, 

and therefore (1271) holds. Hence the {.4 n } are nested. 

Let T* be the tree corresponding to T£. The final step is to show that if the {A n } are nested then 
T°(A) = 7~*. To that aim we show that J(T*) C X(T°(A)) and (T(T*)) C C (J(T°(A))) C . Pick an arbitrary 
y G J(7"°). Using Lemma [21 we compare with D 7 <M7y° 7 (A)). We distinguish two cases. First 

suppose that y G X(7~*), i.e., J A (y) > E 7 J\{vi)- Then 

•My) > E -Miry) > E <MT y ° 7 (A)), 

and hence y £(T°(A)). But since the {*4„} are nested, no prefix of y can be an element of £(7~°(A)) 
and hence y G J(7~°(A)). 

Second, assume y T(T*). If l(y) = k, then clearly y J(T°(A)). If Z(y) < k, then Ja(?/) < 
E 7 J\{yi) and we now show by induction that this implies that T^(\) = {y}. Note first that as the {A n } 
are nested, we have for any y G T(Ty) (i.e., for any y with prefix y) 

Jx(y) < E Uvi)- (28) 

76y 

Assume first that has depth one. Then (1281) implies by Lemma [2] that 7~^°(A) = {y}. Suppose then that 
this is true for all 7^ of depth at most k — 1. Let 7"? have depth fc. Then by the induction hypothesis and 
(El 

E 'aCz&CA)) = E Uvi) > Jx(y), 

■yey ■yGy 

and thus 7?(A) = {y} by Lemma [2l concluding the induction step. This implies y ^ Z(T°(A)). 
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